-
Notifications
You must be signed in to change notification settings - Fork 6
QVAC-3697: Load GGUF File From Buffer #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: temp-load-from-buffer
Are you sure you want to change the base?
QVAC-3697: Load GGUF File From Buffer #1
Conversation
Convert llama_file to a pure virtual class that can be overriden by multiple implementations (disk, single memory buffer, ...)
Define a new macro LLAMA_LOG_CMAKE_DEBUG that results in no-op when a release build is activated. This will allow to have a good trace and debugging capabilities that will be specially useful for the async loading of multiple model shards.
This change adds an additional automated test loading from disk, to ensure the existing functionallity does not break.
The gguf-split utility now generates a `.txt` listing all tensors. Useful both for manual inspection/debugging and for incremental tensor loading where its not possible to know tensors present in other split files (the information is critical to handle optional tensors).
I seem to lack permissions to add reviewers. It is on draft until I test it on a bare Addon but the review of the Llama.cpp C++ code can start: @olyasir @olek-tether @gianni-cor @chetasr @yuranich @jpgaribotti |
02227e3
to
0718c30
Compare
Updated tests to automatically skip based on the gguf filename (sharded or not) when running all tests at once. |
5df4e25
to
52ed642
Compare
Un-drafting since I was able to run JS integration test for qwen3 llm Addon without problems. The test now can use any dataloader implementation and will incrementally load the Llama.cpp model. See successful log below. |
We should not merge to master, it will make maintaining the fork more difficult. For example, we currently have another PR to merge from upstream to bring the fork up to date. We should create a differently named branch for our changes to the fork. |
can we do the following:
|
Fine with me. Please create a tether branch where to merge the changes @yuranich
I have a task in the Asana the project to do this, but I don't know how easy will it be with the amount of changes. Maybe we can merge some of the commits. |
temp-load-from-buffer |
52ed642
to
85405d9
Compare
720734c
to
45d84b8
Compare
Force-pushed to attempt to fix CI on some platforms, due to different compilers/configs it was failing on some of them. |
4277f06
to
4d263be
Compare
- Ensures a char trait implementation for uint8 exists, that can be used with std::basic_streambuff. - Adds an implementation of std::basic_streambuff for a single vector. Will be used by llama.cpp and tests when loading from a single memory buffer.
Override the pure virtual interface with a class that can operate on a single memory buffer.
Auxiliary function to convert a list of C strings to a vector of C++ strings.
Add new GGUF reader implementation that can read metadata from a memory buffer.
- Add code to be able to load a gguf file from a variant (memory or disk). - Some structs simplify how to load a file and keep track of the pointers (which are now in the same struct).
Move the loader code, that process a file after it has been loaded into memory and populate its own attributes, to a reusable method.
Add new C++ function to Llama main header to load from a single memory buffer, and propagate changes to internal calls/constructors.
A file buffer that can be fulfilled using string keys. The extract method waits until the file is provided.
Handles the logic for incrementally loading files and tensors is model shards.
Refactor backend buffer creation (for model loading) into functions.
- The function now takes size_data instead of the member attribute. - Sanity checks of file pointer handles These two changes will be useful when calling `load_all_data` multiple times during incremental shard load.
Adapt the loader and model load to incrementally load files and upload tensors.
Add functions to Llama.cpp public headers to asynchronously load shards.
Split some common loading functionallity. This will help in the memory loading tests.
Add a submodule with re-usable code for tests.
Adapt embedding example to showcase how to load from memory. Can be configured through environment variables.
Adapt simple example to showcase how to load from memory. Can be configured with environment variables. Qwen3, for example, can be used with the simple example.
Add some automatic tests that load from memory (single buffer or multiple async splits)
4d263be
to
cd1b485
Compare
Most CI pipelines pass now. Some target failures seem unrelated. |
@jpgaribotti @yuranich Can you suggest what to do with remaining failing CI pipelines? Seem to be due to unrelated issues, for example:
Is it okay to proceed with the review as it is? |
This pull request makes changes in Llama.cpp in order to be able to load models directly from memory. It is intended to be reviewable by commit. Individual commits contain a long text description below the header.
Tested that works properly from a bare Addon (LLM repo). See #1 (comment)
In particular, this PR exposes:
llama-cpp.h:llama_model_load_from_buffer(vector<uint8_t>&& data, ...)
to load from a single buffer containing a .gguf file contents.llama.h:llama_model_load_from_split_futures(char** paths, ...)
andllama-cpp.h:llama_model_load_fulfill_split_future(char* path, ..., unique_ptr<basic_streambuf<uint8_t>>&& streambuf)
which allow to asynchronously/incrementally load a model and upload its tensors to the backend storage while host memory is being released.How to run the code?
Build and prepare model
Build (e.g. in release mode) LLama.cpp including the examples, tests and tools:
Generate a sharded model and its
*.tensor.txt
summary file:Automated tests
Run automated tests for a single
gguf
file:Run automated tests for sharded model:
Or simply run all tests:
Should output:
Examples
Demo video: https://drive.google.com/file/d/1mjqecwJ1LFYUNofr4wIdPFK9IkUxbHZh/view?usp=sharing
Set up the environment:
Run example with Qwen3:
Outputs:
Run example with GTE:
Related PRs
Asana task: https://app.asana.com/1/45238840754660/project/1210873391319186/task/1210877463428607